Skip to content

Dev/use vllm #1053

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 13, 2025
Merged

Dev/use vllm #1053

merged 7 commits into from
Mar 13, 2025

Conversation

qi-hua
Copy link

@qi-hua qi-hua commented Mar 7, 2025

目前使用 vllm 的 AsyncLLMEngine 加速推理

增加了 cosyvoice.llm.llm_vllm.VllmQwen2LM ,其他文件主要是简单修改;

VllmQwen2LM目前支持多任务推理,并发需对原接口适当修改;

使用trt的情况下,加速后效果 rtf 能够达到 0.1-0.15

qi-hua added 6 commits March 7, 2025 20:26
- 新增基于队列和线程的异步推理机制
- 优化同步推理接口,使用新机制实现
- 删除了 LLM 类中的 async_llm_inference 方法
- 该方法尚未使用,且再在loop_thread之外运行后会导致 vllm 崩溃,因此将其移除
- 新增 speed_test.ipynb 文件,用于测试 CosyVoice2模型的性能
- 包含测试环境配置、默认情况下的使用示例、使用 vllm 加速 LLM 推理的步骤
- 移除任务队列和单任务处理限制
- 使用 asyncio.run_coroutine_threadsafe() 在后台线程中运行推理任务
@wang-TJ-20
Copy link

wang-TJ-20 commented Mar 8, 2025

@qi-hua 你好,感谢你的分享,我尝试你这个分支,想问下需要多少显存可以呢。gpu_memory_utilization在代码的哪块加呢
image

@qi-hua
Copy link
Author

qi-hua commented Mar 8, 2025

@qi-hua 你好,感谢你的分享,我尝试你这个分支,想问下需要多少显存可以呢。gpu_memory_utilization在代码的哪块加呢 !

vllm需要大概3-4G的显存就可以了,gpu_memory_utilization的设置目前放在了cosyvoice/llm/llm_vllm.py:39 ENGINE_ARGS中,还没有暴露设置的位置,需要手动修改。

- 在 Frontend 中,恢复原本逐个生成文本令牌
- 在 Model 类中,移除了不必要的日志信息和断言,简化了文本令牌的处理流程
@wang-TJ-20
Copy link

@qi-hua 你好,感谢你的分享,我尝试你这个分支,想问下需要多少显存可以呢。gpu_memory_utilization在代码的哪块加呢 !

vllm需要大概3-4G的显存就可以了,gpu_memory_utilization的设置目前放在了cosyvoice/llm/llm_vllm.py:39 ENGINE_ARGS中,还没有暴露设置的位置,需要手动修改。

哦哦,感谢感谢,另外,我发现一个问题是,我按下面的测试脚本测试出现了下面的错误,
image
image
根据报错提示改成下面的调用方式就可以了,请问是vllm版本实现中启动了多进程吗
image

@qi-hua
Copy link
Author

qi-hua commented Mar 9, 2025

我不了解这两种方式的区别,但默认会启动很多的进程。

@aluminumbox
Copy link
Collaborator

@lyblsgo 麻烦帮忙看一下这个代码

@lyblsgo lyblsgo merged commit 00b454c into FunAudioLLM:dev/Comet Mar 13, 2025
@deyituo
Copy link

deyituo commented Mar 13, 2025

@qi-hua dev/use_vllm 按照speed_test.ipynb加载vllm模型还跑不通。async_cosyvoice可以跑通。
@wang-TJ-20 你改了一些代码?

@wang-TJ-20
Copy link

wang-TJ-20 commented Mar 13, 2025

@qi-hua dev/use_vllm 按照speed_test.ipynb加载vllm模型还跑不通。async_cosyvoice可以跑通。 @wang-TJ-20 你改了一些代码?

我没改代码
运行步骤如下:
1、先按requirements_vllm.txt安装环境,建议新建个conda环境,完全按照requirements_vllm.txt里装。
2、在speed_test.ipynb里运行一下,下面这个代码块,注册下模型的类。
image
3、将async_cosyvoice仓库里的配置文件(如下图)复制到2.0的模型权重文件夹下
image
4、使用下面的代码进行调用,我是使用的自己的spk_id,指定的是girl,这个根据自己选择,也可以直接用prompt_speech_16k。
`def main():
# 初始化模型
cosyvoice = CosyVoice2(CosyVoice2-0.5B',
load_jit=False,
load_trt=True,
fp16=True,
use_vllm=True)

# 加载提示语音
prompt_speech_16k = load_wav("girl_cut.wav", 16000)
text = "今天天气不错"
for _ in range(20):
    time1 = time.time()
    audio_list = []
    # for i, j in enumerate(cosyvoice.inference_instruct2("今天天气不错", '以悲伤的情感说', prompt_speech_16k, stream=True)):
    for i, j in enumerate(cosyvoice.inference_instruct2_by_spk_id(text, "以悲伤的情感说", 'girl', stream=True)):
        if i == 0:
            logging.info(f"首包耗时: {time.time() - time1}")
        audio_list.append(j['tts_speech'])
    full_tts = torch.cat(audio_list, dim=1)
    torchaudio.save('instruct.wav', full_tts, cosyvoice.sample_rate)

if name == 'main':
main()`

@deyituo
Copy link

deyituo commented Mar 13, 2025

@qi-hua dev/use_vllm 按照speed_test.ipynb加载vllm模型还跑不通。async_cosyvoice可以跑通。 @wang-TJ-20 你改了一些代码?

我没改代码 运行步骤如下: 1、先按requirements_vllm.txt安装环境,建议新建个conda环境,完全按照requirements_vllm.txt里装。 2、在speed_test.ipynb里运行一下,下面这个代码块,注册下模型的类。 image 3、将async_cosyvoice仓库里的配置文件(如下图)复制到2.0的模型权重文件夹下 image 4、使用下面的代码进行调用,我是使用的自己的spk_id,指定的是girl,这个根据自己选择,也可以直接用prompt_speech_16k。 `def main(): # 初始化模型 cosyvoice = CosyVoice2(CosyVoice2-0.5B', load_jit=False, load_trt=True, fp16=True, use_vllm=True)

# 加载提示语音
prompt_speech_16k = load_wav("girl_cut.wav", 16000)
text = "今天天气不错"
for _ in range(20):
    time1 = time.time()
    audio_list = []
    # for i, j in enumerate(cosyvoice.inference_instruct2("今天天气不错", '以悲伤的情感说', prompt_speech_16k, stream=True)):
    for i, j in enumerate(cosyvoice.inference_instruct2_by_spk_id(text, "以悲伤的情感说", 'girl', stream=True)):
        if i == 0:
            logging.info(f"首包耗时: {time.time() - time1}")
        audio_list.append(j['tts_speech'])
    full_tts = torch.cat(audio_list, dim=1)
    torchaudio.save('instruct.wav', full_tts, cosyvoice.sample_rate)

if name == 'main': main()`

这样可以跑起来了,这个分支还得优化下,跟main分支代码有点差别

import time
import logging
import torch
import torchaudio

import sys
sys.path.append('third_party/Matcha-TTS')

from cosyvoice.cli.cosyvoice import CosyVoice2
from cosyvoice.utils.file_utils import load_wav

prompt_text = '希望你以后能够做得比我还好哟'
prompt_speech_16k = load_wav('./asset/zero_shot_prompt.wav', 16000)

def main():
    cosyvoice = CosyVoice2(
        './pretrained_models/CosyVoice2-0.5B', 
        load_jit=False, 
        load_trt=True, 
        fp16=True, 
        use_vllm=True,
    )
    
    logging.info(f"\n\ninference_zero_shot")    
    for i, j in enumerate(cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', prompt_text, prompt_speech_16k, stream=False)):
        torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
    
    time1 = time.time()    
    for i, j in enumerate(cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', prompt_text, prompt_speech_16k, stream=True)):
        if i == 0:
            logging.info(f"首包耗时: {time.time() - time1}")
        torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
        
    logging.info(f"\n\ninference_zero_shot + bistream")
    def text_generator():
        yield '收到好友从远方寄来的生日礼物,'
        yield '那份意外的惊喜与深深的祝福'
        yield '让我心中充满了甜蜜的快乐,'
        yield '笑容如花儿般绽放。'
    for i, j in enumerate(cosyvoice.inference_zero_shot(text_generator(), prompt_text, prompt_speech_16k, stream=False)):
        torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

    def text_generator():
        yield '收到好友从远方寄来的生日礼物,'
        yield '那份意外的惊喜与深深的祝福'
        yield '让我心中充满了甜蜜的快乐,'
        yield '笑容如花儿般绽放。'
    time1 = time.time()        
    for i, j in enumerate(cosyvoice.inference_zero_shot(text_generator(), prompt_text, prompt_speech_16k, stream=True)):
        if i == 0:
            logging.info(f"首包耗时: {time.time() - time1}")
        torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

    # instruct usage
    logging.info(f"\n\ninstruct2 usage")
    for i, j in enumerate(cosyvoice.inference_instruct2('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '用四川话说这句话', prompt_speech_16k, stream=False)):
        torchaudio.save('instruct2_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

    # instruct usage
    time1 = time.time()    
    for i, j in enumerate(cosyvoice.inference_instruct2('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '用四川话说这句话', prompt_speech_16k, stream=True)):
        if i == 0:
            logging.info(f"首包耗时: {time.time() - time1}")
        torchaudio.save('instruct2_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

        
    
if __name__ == "__main__":
    main()

2025-03-13 16:12:48,748 INFO synthesis text 收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。
2025-03-13 16:12:50,609 INFO yield speech len 14.44, rtf 0.1288315761122347
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.57s/it]
0%| | 0/1 [00:00<?, ?it/s]2025-03-13 16:12:50,723 INFO synthesis text 收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。
2025-03-13 16:12:51,068 INFO yield speech len 1.84, rtf 0.18700866595558496
2025-03-13 16:12:51,068 INFO 首包耗时: 0.40618014335632324
2025-03-13 16:12:51,334 INFO yield speech len 2.0, rtf 0.12627971172332764
2025-03-13 16:12:51,592 INFO yield speech len 2.0, rtf 0.12179696559906006
2025-03-13 16:12:51,956 INFO yield speech len 2.0, rtf 0.17377448081970215
2025-03-13 16:12:52,246 INFO yield speech len 2.0, rtf 0.1356205940246582
2025-03-13 16:12:52,521 INFO yield speech len 2.0, rtf 0.13052964210510254
2025-03-13 16:12:52,708 INFO yield speech len 1.32, rtf 0.12939536210262415
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.06s/it]
2025-03-13 16:12:52,727 INFO

inference_zero_shot + bistream
2025-03-13 16:12:52,729 INFO get tts_text generator, will skip text_normalize!
0%| | 0/1 [00:00<?, ?it/s]2025-03-13 16:12:52,729 INFO get tts_text generator, will return _extract_text_token_generator!
2025-03-13 16:12:52,786 INFO synthesis text <generator object main..text_generator at 0x7f9a4a1ee650>
2025-03-13 16:12:54,319 INFO yield speech len 14.0, rtf 0.10952198505401611
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.63s/it]
2025-03-13 16:12:54,363 INFO get tts_text generator, will skip text_normalize!
0%| | 0/1 [00:00<?, ?it/s]2025-03-13 16:12:54,363 INFO get tts_text generator, will return _extract_text_token_generator!
2025-03-13 16:12:54,418 INFO synthesis text <generator object main..text_generator at 0x7f9a4a1ee730>
2025-03-13 16:12:54,826 INFO yield speech len 1.84, rtf 0.22185600322225818
2025-03-13 16:12:54,826 INFO 首包耗时: 0.46564769744873047
2025-03-13 16:12:55,056 INFO yield speech len 2.0, rtf 0.10754311084747314
2025-03-13 16:12:55,308 INFO yield speech len 2.0, rtf 0.11862730979919434
2025-03-13 16:12:55,575 INFO yield speech len 2.0, rtf 0.12509751319885254
2025-03-13 16:12:55,958 INFO yield speech len 2.0, rtf 0.18403804302215576
2025-03-13 16:12:56,214 INFO yield speech len 2.0, rtf 0.11977958679199219
2025-03-13 16:12:56,394 INFO yield speech len 0.28, rtf 0.5912005901336669
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.04s/it]
2025-03-13 16:12:56,408 INFO

instruct2 usage
0%| | 0/1 [00:00<?, ?it/s]2025-03-13 16:12:56,471 INFO synthesis text 收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。
2025-03-13 16:12:57,551 INFO yield speech len 10.56, rtf 0.10220731298128763
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.17s/it]
0%| | 0/1 [00:00<?, ?it/s]2025-03-13 16:12:57,648 INFO synthesis text 收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。
2025-03-13 16:12:57,965 INFO yield speech len 1.84, rtf 0.17231897167537522
2025-03-13 16:12:57,965 INFO 首包耗时: 0.3753683567047119
2025-03-13 16:12:58,197 INFO yield speech len 2.0, rtf 0.10890007019042969
2025-03-13 16:12:58,452 INFO yield speech len 2.0, rtf 0.11727273464202881
2025-03-13 16:12:58,821 INFO yield speech len 2.0, rtf 0.17509591579437256
2025-03-13 16:12:59,076 INFO yield speech len 2.0, rtf 0.1149369478225708
2025-03-13 16:12:59,252 INFO yield speech len 1.24, rtf 0.1286693157688264

@NeverSayXz
Copy link

请问一下vllm推理时 prompt text和text最后都加了6564是什么原因

@qi-hua
Copy link
Author

qi-hua commented Mar 14, 2025

请问一下vllm推理时 prompt text和text最后都加了6564是什么原因

为了区分 文字token_id(所有的文字token增大6564,在vllm中计算时再减去6564) 与 语音token_id(输入的参考音频的token、llm模型推理输出的token),以便在vllm模型中将其转为正确的embedding

具体代码见 vllm_use_cosyvoice2_model.CosyVoice2Model.get_input_embeddings

@wang-TJ-20
Copy link

wang-TJ-20 commented Mar 16, 2025

@qi-hua hi,咨询个问题,咱们这个分支多线程并发后的耗时会上升吗,看起来多线程并发的耗时卡点会在flow模块trt推理上的进程锁。
image

  • 还有就是咱们vllm加速的llm并发,是顺序队列请求吗,即一次请求完,再请求第二次,因为测试下来多线程的推理耗时基本按线程数数量成倍增加

@qi-hua
Copy link
Author

qi-hua commented Mar 16, 2025

@qi-hua hi,咨询个问题,咱们这个分支多线程并发后的耗时会上升吗,看起来多线程并发的耗时卡点会在flow模块trt推理上的进程锁。

  • 还有就是咱们vllm加速的llm并发,是顺序队列请求吗,即一次请求完,再请求第二次,因为测试下来多线程的推理耗时基本按线程数数量成倍增加
  • flow模块没有进行改动,与原来的计算方式是一样的;

  • vllm加速的llm 内部vllm推理是在后台线程异步并发处理的;

  • 虽然VllmQwen2LM.llm_inference函数中 out_queue.get() 是阻塞的,但考虑整个CosyVoiceModel.llm_job()是在单独的一个线程里运行,应该是支持并发的。如果是少量并发且同一时间开始,最大可能是在一段时间内完成了所有的llm推理任务,在等待flow部分的推理。
    image

@jnkr36
Copy link

jnkr36 commented Mar 19, 2025

@qi-hua 您好 麻烦请教一下哈
我拉去了dev/Comet这个分支 想试一下vllm的加速效果,按照你写的流程来测试的时候,运行到 self.llm_engine: AsyncLLMEngine = AsyncLLMEngine.from_engine_args(engine_args)这里时,总是遇到错误,能麻烦你帮忙看一下是什么原因吗~~感激
这是执行的代码:

import time
import asyncio
import torchaudio

import sys
sys.path.append('third_party/Matcha-TTS')
sys.path.append("/usr/lib/python3/dist-packages")
from cosyvoice.cli.cosyvoice import CosyVoice2
from cosyvoice.utils.file_utils import load_wav

prompt_text = '希望你以后能够做得比我还好哟'
prompt_speech_16k = load_wav('./asset/zero_shot_prompt.wav', 16000)

cosyvoice = CosyVoice2(
'./pretrained_models/CosyVoice2-0.5B',
load_jit=True,
load_trt=True,
fp16=True,
use_vllm=True,
)

for i, j in enumerate(cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', prompt_text, prompt_speech_16k, stream=False)):
torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

这是运行时的错误日志:

2025-03-19 01:24:57,786 - modelscope - INFO - PyTorch version 2.5.1 Found.
2025-03-19 01:24:57,787 - modelscope - INFO - Loading ast index from /home/oppoer/.cache/modelscope/ast_indexer
2025-03-19 01:24:57,808 - modelscope - INFO - Loading done! Current index file version is 1.15.0, with md5 2cb5b0afa0d9f3a36eaf1baa92beb149 and a total number of 980 components indexed
Sliding Window Attention is enabled but not implemented for sdpa; unexpected results may be encountered.
/home/oppoer/.local/lib/python3.10/site-packages/diffusers/models/lora.py:393: FutureWarning: LoRACompatibleLinear is deprecated and will be removed in version 1.0.0. Use of LoRACompatibleLinear is deprecated. Please switch to PEFT backend by installing PEFT: pip install peft.
deprecate("LoRACompatibleLinear", "1.0.0", deprecation_message)
2025-03-19 01:25:07,184 INFO input frame rate=25
2025-03-19 01:25:09.708549881 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 8 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2025-03-19 01:25:09.710625810 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2025-03-19 01:25:09.710649943 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
text.cc: festival_Text_init
open voice lang map failed
INFO 03-19 01:25:11 init.py:207] Automatically detected platform cuda.
WARNING 03-19 01:25:12 config.py:2448] Casting torch.bfloat16 to torch.float16.
INFO 03-19 01:25:12 config.py:549] This model supports multiple tasks: {'reward', 'generate', 'score', 'embed', 'classify'}. Defaulting to 'generate'.
INFO 03-19 01:25:12 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=1024.
WARNING 03-19 01:25:12 utils.py:2128] CUDA was previously initialized. We must use the spawn multiprocessing start method. Setting VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing for more information.

2025-03-19 01:25:13,819 - modelscope - INFO - PyTorch version 2.5.1 Found.
2025-03-19 01:25:13,819 - modelscope - INFO - Loading ast index from /home/oppoer/.cache/modelscope/ast_indexer
2025-03-19 01:25:13,841 - modelscope - INFO - Loading done! Current index file version is 1.15.0, with md5 2cb5b0afa0d9f3a36eaf1baa92beb149 and a total number of 980 components indexed
Sliding Window Attention is enabled but not implemented for sdpa; unexpected results may be encountered.
/home/oppoer/.local/lib/python3.10/site-packages/diffusers/models/lora.py:393: FutureWarning: LoRACompatibleLinear is deprecated and will be removed in version 1.0.0. Use of LoRACompatibleLinear is deprecated. Please switch to PEFT backend by installing PEFT: pip install peft.
deprecate("LoRACompatibleLinear", "1.0.0", deprecation_message)
2025-03-19 01:25:20,808 INFO input frame rate=25
2025-03-19 01:25:23.095775818 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 8 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2025-03-19 01:25:23.097837734 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2025-03-19 01:25:23.097856527 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
text.cc: festival_Text_init
open voice lang map failed
INFO 03-19 01:25:24 init.py:207] Automatically detected platform cuda.
WARNING 03-19 01:25:25 config.py:2448] Casting torch.bfloat16 to torch.float16.
INFO 03-19 01:25:25 config.py:549] This model supports multiple tasks: {'embed', 'generate', 'classify', 'score', 'reward'}. Defaulting to 'generate'.
INFO 03-19 01:25:25 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=1024.
2025-03-19 01:25:25,898 WARNING use vllm inference failed.

    An attempt has been made to start a new process before the
    current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/usr/lib/python3.10/runpy.py", line 289, in run_path
return _run_module_code(code, init_globals, run_name,
File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/notebook/data/personal/tts/demo/dev-Comet-CosyVoice/vllm_inference_test.py", line 21, in
cosyvoice = CosyVoice2(
File "/home/notebook/data/personal/tts/demo/dev-Comet-CosyVoice/cosyvoice/cli/cosyvoice.py", line 172, in init
raise e
File "/home/notebook/data/personal/tts/demo/dev-Comet-CosyVoice/cosyvoice/cli/cosyvoice.py", line 169, in init
self.model = VllmCosyVoice2Model(model_dir, configs['flow'], configs['hift'], fp16)
File "/home/notebook/data/personal/tts/demo/dev-Comet-CosyVoice/cosyvoice/cli/model.py", line 424, in init
llm = VllmQwen2LM(model_dir)
File "/home/notebook/data/personal/tts/demo/dev-Comet-CosyVoice/cosyvoice/llm/llm_vllm.py", line 81, in init
self.llm_engine: AsyncLLMEngine = AsyncLLMEngine.from_engine_args(engine_args)
File "/home/oppoer/.local/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 114, in from_engine_args
return cls(
File "/home/oppoer/.local/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 85, in init
self.engine_core = EngineCoreClient.make_client(
File "/home/oppoer/.local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 61, in make_client
return AsyncMPClient(vllm_config, executor_class, log_stats)
File "/home/oppoer/.local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 340, in init
super().init(
File "/home/oppoer/.local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 220, in init
self.proc_handle = BackgroundProcHandle(
File "/home/oppoer/.local/lib/python3.10/site-packages/vllm/v1/utils.py", line 118, in init
self.proc.start()
File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

我一步步debug的时候,发现是每次运行到cosyvoice/llm/llm_vllm.py里面的self.llm_engine: AsyncLLMEngine = AsyncLLMEngine.from_engine_args(engine_args)的时候就出问题了。当执行上面的运行代码脚本时,第一次运行到这一行的时候,在点单步运行,结果居然又跳到了运行代码从头开始的时候又开始从头执行了,然后又运行到这一行代码是,就报错了,所以可以看到整个log日志里面,有两段重复的日志,不知道这个什么原因,日志里给出了写RuntimeError的提示,但我也不太明白应该怎么解决,能麻烦您帮忙看看吗~感谢感谢~~

@qi-hua
Copy link
Author

qi-hua commented Mar 19, 2025

CosyVoice2 使用vllm加速llm推理推理的方法

1. CosyVoice2实例化时添加 use_vllm=True 参数
2. 确保CosyVoice2在完成主线程的引导后再进行实例化

在启用vllm引擎的时候, 因vllm使用了多进程、需在主线程启动,所以需要确保CosyVoice2在完成主线程的引导后再进行实例化,需要对原先的CosyVoice2初始化代码进行适当修改。

比如,之前的代码是:

from cosyvoice.cli.cosyvoice import CosyVoice2

cosyvoice = CosyVoice2('pretrained_models/CosyVoice2-0.5B',  load_jit=True, load_trt=True, fp16=True, use_vllm=True)

def inference():
	prompt_speech_16k = load_wav('./asset/zero_shot_prompt.wav', 16000)  
	for i, j in enumerate(cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '希望你以后能够做的比我还好呦。', prompt_speech_16k, stream=False)):  
	    torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

if __name__ == '__main__':
	inference()

上面代码的启动时会报错,原因在于初始化CosyVoice2时,vllm在内部将启动多进程,而此时主进程还未完成引导,从而触发错误。
正确的做法是将初始化代码移到main函数内部,确保主线程完成引导后再实例化CosyVoice2,确保多进程的vllm在正确的上下文中启动。

from cosyvoice.cli.cosyvoice import CosyVoice2

cosyvoice: CosyVoice2|None = None

def inference():
	prompt_speech_16k = load_wav('./asset/zero_shot_prompt.wav', 16000)  
	for i, j in enumerate(cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '希望你以后能够做的比我还好呦。', prompt_speech_16k, stream=False)):  
	    torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

def main():
	global cosyvoice
	cosyvoice = CosyVoice2('pretrained_models/CosyVoice2-0.5B',  load_jit=True, load_trt=True, fp16=True, use_vllm=True)
	# 其他推理任务
	inference()

if __name__ == '__main__':
	main()

@jnkr36

@jnkr36
Copy link

jnkr36 commented Mar 21, 2025

非常感谢@qi-hua大佬的回复哈,受教了~~,通过这种方式可以成功运行。
另外再请教两个问题哈,
1、我想把上面的流程用fastapi封装成服务,为了支持多一些的并发,启动fastapi服务的时候,我想启动两个worker(我理解应该就是会实例化两个CosyVoice2出来,显存占用也会double一下,目前GPU显存够double的量),但是启动的时候,加入了workers参数,好像就运行不了,不加worker的话(默认应该就是1吧),是可以的。这个是启动服务的代码
image

这个是调用服务的日志,感觉是启动2个worker的时候,并没有正确的初始化CosyVoice2,导致找不到。不知道如果想启动两个worker应该怎么操作呢,另外,这种启动多个worker的方式是不是可以支持更多的并发呀,还麻烦帮忙解答一下哈。
2025-03-21 03:04:21,320 - modelscope - INFO - PyTorch version 2.5.1 Found.
2025-03-21 03:04:21,321 - modelscope - INFO - Loading ast index from /home/oppoer/.cache/modelscope/ast_indexer
2025-03-21 03:04:21,342 - modelscope - INFO - Loading done! Current index file version is 1.15.0, with md5 c9bd43650d2dc196e6e1b33b56c12959 and a total number of 980 components indexed
failed to import ttsfrd, use WeTextProcessing instead
2025-03-21 03:04:24,341 DEBUG Starting new HTTPS connection (1): www.modelscope.cn:443
2025-03-21 03:04:24,686 DEBUG https://www.modelscope.cn:443 "GET /api/v1/models/iic/CosyVoice2-0.5B/revisions HTTP/1.1" 200 205
2025-03-21 03:04:25,148 DEBUG https://www.modelscope.cn:443 "GET /api/v1/models/iic/CosyVoice2-0.5B/repo/files?Revision=master&Recursive=True HTTP/1.1" 200 None
2025-03-21 03:04:25,202 DEBUG Starting new HTTPS connection (1): www.modelscope.cn:443
2025-03-21 03:04:25,575 DEBUG https://www.modelscope.cn:443 "GET /api/v1/models/iic/CosyVoice2-0.5B/revisions HTTP/1.1" 200 205
2025-03-21 03:04:25,982 DEBUG https://www.modelscope.cn:443 "GET /api/v1/models/iic/CosyVoice2-0.5B/repo/files?Revision=master&Recursive=True HTTP/1.1" 200 None
Sliding Window Attention is enabled but not implemented for sdpa; unexpected results may be encountered.
/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/diffusers/models/lora.py:393: FutureWarning: LoRACompatibleLinear is deprecated and will be removed in version 1.0.0. Use of LoRACompatibleLinear is deprecated. Please switch to PEFT backend by installing PEFT: pip install peft.
deprecate("LoRACompatibleLinear", "1.0.0", deprecation_message)
2025-03-21 03:04:28,165 INFO input frame rate=25
2025-03-21 03:04:29.524953394 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 8 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2025-03-21 03:04:29.527004226 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2025-03-21 03:04:29.527025659 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2025-03-21 03:04:29,811 WETEXT INFO found existing fst: /opt/conda/envs/cosyvoice/lib/python3.10/site-packages/tn/zh_tn_tagger.fst
2025-03-21 03:04:29,811 INFO found existing fst: /opt/conda/envs/cosyvoice/lib/python3.10/site-packages/tn/zh_tn_tagger.fst
2025-03-21 03:04:29,811 WETEXT INFO /opt/conda/envs/cosyvoice/lib/python3.10/site-packages/tn/zh_tn_verbalizer.fst
2025-03-21 03:04:29,811 INFO /opt/conda/envs/cosyvoice/lib/python3.10/site-packages/tn/zh_tn_verbalizer.fst
2025-03-21 03:04:29,811 WETEXT INFO skip building fst for zh_normalizer ...
2025-03-21 03:04:29,811 INFO skip building fst for zh_normalizer ...
2025-03-21 03:04:30,024 WETEXT INFO found existing fst: /opt/conda/envs/cosyvoice/lib/python3.10/site-packages/tn/en_tn_tagger.fst
2025-03-21 03:04:30,024 INFO found existing fst: /opt/conda/envs/cosyvoice/lib/python3.10/site-packages/tn/en_tn_tagger.fst
2025-03-21 03:04:30,024 WETEXT INFO /opt/conda/envs/cosyvoice/lib/python3.10/site-packages/tn/en_tn_verbalizer.fst
2025-03-21 03:04:30,024 INFO /opt/conda/envs/cosyvoice/lib/python3.10/site-packages/tn/en_tn_verbalizer.fst
2025-03-21 03:04:30,024 WETEXT INFO skip building fst for en_normalizer ...
2025-03-21 03:04:30,024 INFO skip building fst for en_normalizer ...
INFO 03-21 03:04:30 init.py:207] Automatically detected platform cuda.
WARNING 03-21 03:04:32 registry.py:352] Model architecture CosyVoice2Model is already registered, and will be overwritten by the new model class <class 'cosyvoice.llm.vllm_use_cosyvoice2_model.CosyVoice2Model'>.
WARNING 03-21 03:04:32 config.py:2448] Casting torch.bfloat16 to torch.float16.
INFO 03-21 03:04:32 config.py:549] This model supports multiple tasks: {'score', 'embed', 'classify', 'reward', 'generate'}. Defaulting to 'generate'.
INFO 03-21 03:04:32 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=1024.
WARNING 03-21 03:04:32 utils.py:2128] CUDA was previously initialized. We must use the spawn multiprocessing start method. Setting VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing for more information.
2025-03-21 03:04:33,837 - modelscope - INFO - PyTorch version 2.5.1 Found.
2025-03-21 03:04:33,838 - modelscope - INFO - Loading ast index from /home/oppoer/.cache/modelscope/ast_indexer
2025-03-21 03:04:33,859 - modelscope - INFO - Loading done! Current index file version is 1.15.0, with md5 c9bd43650d2dc196e6e1b33b56c12959 and a total number of 980 components indexed
failed to import ttsfrd, use WeTextProcessing instead
INFO 03-21 03:04:37 init.py:207] Automatically detected platform cuda.
INFO 03-21 03:04:38 core.py:50] Initializing a V1 LLM engine (v0.7.3) with config: model='/home/oppoer/.cache/modelscope/hub/iic/CosyVoice2-0___5B', speculative_config=None, tokenizer='/home/oppoer/.cache/modelscope/hub/iic/CosyVoice2-0___5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/oppoer/.cache/modelscope/hub/iic/CosyVoice2-0___5B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 03-21 03:04:38 utils.py:2262] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,list_loras,load_config,pin_lora,remove_lora,scheduler_config not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f239070ebc0>
INFO 03-21 03:04:38 gpu_model_runner.py:1049] Starting to load model /home/oppoer/.cache/modelscope/hub/iic/CosyVoice2-0___5B...
INFO 03-21 03:04:38 cuda.py:157] Using Flash Attention backend on V1 engine.
WARNING 03-21 03:04:38 topk_topp_sampler.py:46] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 03-21 03:04:38 rejection_sampler.py:47] FlashInfer is not available. Falling back to the PyTorch-native implementation of rejection sampling. For the best performance, please install FlashInfer.
/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/torch/utils/device.py:106: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad(True), rather than torch.tensor(sourceTensor).
return func(*args, **kwargs)
Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.27s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.27s/it]

INFO 03-21 03:04:40 gpu_model_runner.py:1060] Loading model weights took 0.9532 GB
INFO 03-21 03:04:44 backends.py:408] Using cache directory: /home/oppoer/.cache/vllm/torch_compile_cache/032e9e3730/rank_0 for vLLM's torch.compile
INFO 03-21 03:04:44 backends.py:418] Dynamo bytecode transform time: 3.55 s
INFO 03-21 03:04:44 backends.py:115] Directly load the compiled graph for shape None from the cache
INFO 03-21 03:04:46 monitor.py:33] torch.compile takes 3.55 s in total
INFO 03-21 03:04:46 kv_cache_utils.py:522] # GPU blocks: 80354
INFO 03-21 03:04:46 kv_cache_utils.py:525] Maximum concurrency for 1024 tokens per request: 1255.53x
INFO 03-21 03:05:02 gpu_model_runner.py:1339] Graph capturing finished in 15 secs, took 1.26 GiB
INFO 03-21 03:05:02 core.py:116] init engine (profile, create kv cache, warmup model) took 22.00 seconds
2025-03-21 03:05:02,324 DEBUG Using selector: EpollSelector
[03/21/2025-03:05:04] [TRT] [I] Loaded engine size: 159 MiB
[03/21/2025-03:05:04] [TRT] [I] [MS] Running engine with multi stream info
[03/21/2025-03:05:04] [TRT] [I] [MS] Number of aux streams is 1
[03/21/2025-03:05:04] [TRT] [I] [MS] Number of total worker streams is 2
[03/21/2025-03:05:04] [TRT] [I] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[03/21/2025-03:05:05] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +4545, now: CPU 0, GPU 4681 (MiB)
INFO: Uvicorn running on http://0.0.0.0:50001 (Press CTRL+C to quit)
INFO: Started parent process [261530]
2025-03-21 03:05:07,088 - modelscope - INFO - PyTorch version 2.5.1 Found.
2025-03-21 03:05:07,088 - modelscope - INFO - Loading ast index from /home/oppoer/.cache/modelscope/ast_indexer
2025-03-21 03:05:07,088 - modelscope - INFO - PyTorch version 2.5.1 Found.
2025-03-21 03:05:07,089 - modelscope - INFO - Loading ast index from /home/oppoer/.cache/modelscope/ast_indexer
2025-03-21 03:05:07,112 - modelscope - INFO - Loading done! Current index file version is 1.15.0, with md5 c9bd43650d2dc196e6e1b33b56c12959 and a total number of 980 components indexed
2025-03-21 03:05:07,112 - modelscope - INFO - Loading done! Current index file version is 1.15.0, with md5 c9bd43650d2dc196e6e1b33b56c12959 and a total number of 980 components indexed
failed to import ttsfrd, use WeTextProcessing instead
failed to import ttsfrd, use WeTextProcessing instead
INFO: Started server process [262208]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Started server process [262207]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: 10.233.185.41:34732 - "GET /inference_test HTTP/1.0" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in call
return await self.app(scope, receive, send)
File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in call
await super().call(scope, receive, send)
File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/applications.py", line 113, in call
await self.middleware_stack(scope, receive, send)
File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in call
raise exc
File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in call
await self.app(scope, receive, _send)
File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in call
await self.app(scope, receive, send)
File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in call
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
raise exc
File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
await app(scope, receive, sender)
File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/routing.py", line 715, in call
await self.middleware_stack(scope, receive, send)
File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
await route.handle(scope, receive, send)
File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
await self.app(scope, receive, send)
File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
raise exc
File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
await app(scope, receive, sender)
File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/routing.py", line 73, in app
response = await f(request)
File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
raw_response = await run_endpoint_function(
File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
return await dependant.call(**values)
File "/home/notebook/data/personal/tts/demo/dev-Comet-CosyVoice/runtime/python/fastapi/server.py", line 72, in inference_test
model_output = cosyvoice.inference_sft(tts_text, spk_id)
NameError: name 'cosyvoice' is not defined. Did you mean: 'CosyVoice'?

2、另外一个问题是,目前推理的时候,llm模型用了vllm进行加速,并且也是支持并发的,但是后面的flow这些模型,好像不能支持并发,是不是也可以多实例化几个flow模型出来,然后来处理vllm并发输出后的llm结果呢,不知道这种方式是不是可以把整个推理流程的并发数再提上来一些呀,个人并不太懂异步和并发的一些东西哈,如果这个思路可行,不知道您这边有没有计划再优化一个版本出来呢,目前vllm加速的效果还是非常明显了,再次感谢您对开源社区的贡献哈~~

@wang-TJ-20
Copy link

非常感谢@qi-hua大佬的回复哈,受教了~~,通过这种方式可以成功运行。 另外再请教两个问题哈, 1、我想把上面的流程用fastapi封装成服务,为了支持多一些的并发,启动fastapi服务的时候,我想启动两个worker(我理解应该就是会实例化两个CosyVoice2出来,显存占用也会double一下,目前GPU显存够double的量),但是启动的时候,加入了workers参数,好像就运行不了,不加worker的话(默认应该就是1吧),是可以的。这个是启动服务的代码 image

这个是调用服务的日志,感觉是启动2个worker的时候,并没有正确的初始化CosyVoice2,导致找不到。不知道如果想启动两个worker应该怎么操作呢,另外,这种启动多个worker的方式是不是可以支持更多的并发呀,还麻烦帮忙解答一下哈。 2025-03-21 03:04:21,320 - modelscope - INFO - PyTorch version 2.5.1 Found. 2025-03-21 03:04:21,321 - modelscope - INFO - Loading ast index from /home/oppoer/.cache/modelscope/ast_indexer 2025-03-21 03:04:21,342 - modelscope - INFO - Loading done! Current index file version is 1.15.0, with md5 c9bd43650d2dc196e6e1b33b56c12959 and a total number of 980 components indexed failed to import ttsfrd, use WeTextProcessing instead 2025-03-21 03:04:24,341 DEBUG Starting new HTTPS connection (1): www.modelscope.cn:443 2025-03-21 03:04:24,686 DEBUG https://www.modelscope.cn:443 "GET /api/v1/models/iic/CosyVoice2-0.5B/revisions HTTP/1.1" 200 205 2025-03-21 03:04:25,148 DEBUG https://www.modelscope.cn:443 "GET /api/v1/models/iic/CosyVoice2-0.5B/repo/files?Revision=master&Recursive=True HTTP/1.1" 200 None 2025-03-21 03:04:25,202 DEBUG Starting new HTTPS connection (1): www.modelscope.cn:443 2025-03-21 03:04:25,575 DEBUG https://www.modelscope.cn:443 "GET /api/v1/models/iic/CosyVoice2-0.5B/revisions HTTP/1.1" 200 205 2025-03-21 03:04:25,982 DEBUG https://www.modelscope.cn:443 "GET /api/v1/models/iic/CosyVoice2-0.5B/repo/files?Revision=master&Recursive=True HTTP/1.1" 200 None Sliding Window Attention is enabled but not implemented for sdpa; unexpected results may be encountered. /opt/conda/envs/cosyvoice/lib/python3.10/site-packages/diffusers/models/lora.py:393: FutureWarning: LoRACompatibleLinear is deprecated and will be removed in version 1.0.0. Use of LoRACompatibleLinear is deprecated. Please switch to PEFT backend by installing PEFT: pip install peft. deprecate("LoRACompatibleLinear", "1.0.0", deprecation_message) 2025-03-21 03:04:28,165 INFO input frame rate=25 2025-03-21 03:04:29.524953394 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 8 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message. 2025-03-21 03:04:29.527004226 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2025-03-21 03:04:29.527025659 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments. 2025-03-21 03:04:29,811 WETEXT INFO found existing fst: /opt/conda/envs/cosyvoice/lib/python3.10/site-packages/tn/zh_tn_tagger.fst 2025-03-21 03:04:29,811 INFO found existing fst: /opt/conda/envs/cosyvoice/lib/python3.10/site-packages/tn/zh_tn_tagger.fst 2025-03-21 03:04:29,811 WETEXT INFO /opt/conda/envs/cosyvoice/lib/python3.10/site-packages/tn/zh_tn_verbalizer.fst 2025-03-21 03:04:29,811 INFO /opt/conda/envs/cosyvoice/lib/python3.10/site-packages/tn/zh_tn_verbalizer.fst 2025-03-21 03:04:29,811 WETEXT INFO skip building fst for zh_normalizer ... 2025-03-21 03:04:29,811 INFO skip building fst for zh_normalizer ... 2025-03-21 03:04:30,024 WETEXT INFO found existing fst: /opt/conda/envs/cosyvoice/lib/python3.10/site-packages/tn/en_tn_tagger.fst 2025-03-21 03:04:30,024 INFO found existing fst: /opt/conda/envs/cosyvoice/lib/python3.10/site-packages/tn/en_tn_tagger.fst 2025-03-21 03:04:30,024 WETEXT INFO /opt/conda/envs/cosyvoice/lib/python3.10/site-packages/tn/en_tn_verbalizer.fst 2025-03-21 03:04:30,024 INFO /opt/conda/envs/cosyvoice/lib/python3.10/site-packages/tn/en_tn_verbalizer.fst 2025-03-21 03:04:30,024 WETEXT INFO skip building fst for en_normalizer ... 2025-03-21 03:04:30,024 INFO skip building fst for en_normalizer ... INFO 03-21 03:04:30 init.py:207] Automatically detected platform cuda. WARNING 03-21 03:04:32 registry.py:352] Model architecture CosyVoice2Model is already registered, and will be overwritten by the new model class <class 'cosyvoice.llm.vllm_use_cosyvoice2_model.CosyVoice2Model'>. WARNING 03-21 03:04:32 config.py:2448] Casting torch.bfloat16 to torch.float16. INFO 03-21 03:04:32 config.py:549] This model supports multiple tasks: {'score', 'embed', 'classify', 'reward', 'generate'}. Defaulting to 'generate'. INFO 03-21 03:04:32 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=1024. WARNING 03-21 03:04:32 utils.py:2128] CUDA was previously initialized. We must use the spawn multiprocessing start method. Setting VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing for more information. 2025-03-21 03:04:33,837 - modelscope - INFO - PyTorch version 2.5.1 Found. 2025-03-21 03:04:33,838 - modelscope - INFO - Loading ast index from /home/oppoer/.cache/modelscope/ast_indexer 2025-03-21 03:04:33,859 - modelscope - INFO - Loading done! Current index file version is 1.15.0, with md5 c9bd43650d2dc196e6e1b33b56c12959 and a total number of 980 components indexed failed to import ttsfrd, use WeTextProcessing instead INFO 03-21 03:04:37 init.py:207] Automatically detected platform cuda. INFO 03-21 03:04:38 core.py:50] Initializing a V1 LLM engine (v0.7.3) with config: model='/home/oppoer/.cache/modelscope/hub/iic/CosyVoice2-0___5B', speculative_config=None, tokenizer='/home/oppoer/.cache/modelscope/hub/iic/CosyVoice2-0___5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/oppoer/.cache/modelscope/hub/iic/CosyVoice2-0___5B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512} WARNING 03-21 03:04:38 utils.py:2262] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,list_loras,load_config,pin_lora,remove_lora,scheduler_config not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f239070ebc0> INFO 03-21 03:04:38 gpu_model_runner.py:1049] Starting to load model /home/oppoer/.cache/modelscope/hub/iic/CosyVoice2-0___5B... INFO 03-21 03:04:38 cuda.py:157] Using Flash Attention backend on V1 engine. WARNING 03-21 03:04:38 topk_topp_sampler.py:46] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer. WARNING 03-21 03:04:38 rejection_sampler.py:47] FlashInfer is not available. Falling back to the PyTorch-native implementation of rejection sampling. For the best performance, please install FlashInfer. /opt/conda/envs/cosyvoice/lib/python3.10/site-packages/torch/utils/device.py:106: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad(True), rather than torch.tensor(sourceTensor). return func(*args, **kwargs) Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s] Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.27s/it] Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.27s/it]

INFO 03-21 03:04:40 gpu_model_runner.py:1060] Loading model weights took 0.9532 GB INFO 03-21 03:04:44 backends.py:408] Using cache directory: /home/oppoer/.cache/vllm/torch_compile_cache/032e9e3730/rank_0 for vLLM's torch.compile INFO 03-21 03:04:44 backends.py:418] Dynamo bytecode transform time: 3.55 s INFO 03-21 03:04:44 backends.py:115] Directly load the compiled graph for shape None from the cache INFO 03-21 03:04:46 monitor.py:33] torch.compile takes 3.55 s in total INFO 03-21 03:04:46 kv_cache_utils.py:522] # GPU blocks: 80354 INFO 03-21 03:04:46 kv_cache_utils.py:525] Maximum concurrency for 1024 tokens per request: 1255.53x INFO 03-21 03:05:02 gpu_model_runner.py:1339] Graph capturing finished in 15 secs, took 1.26 GiB INFO 03-21 03:05:02 core.py:116] init engine (profile, create kv cache, warmup model) took 22.00 seconds 2025-03-21 03:05:02,324 DEBUG Using selector: EpollSelector [03/21/2025-03:05:04] [TRT] [I] Loaded engine size: 159 MiB [03/21/2025-03:05:04] [TRT] [I] [MS] Running engine with multi stream info [03/21/2025-03:05:04] [TRT] [I] [MS] Number of aux streams is 1 [03/21/2025-03:05:04] [TRT] [I] [MS] Number of total worker streams is 2 [03/21/2025-03:05:04] [TRT] [I] [MS] The main stream provided by execute/enqueue calls is the first worker stream [03/21/2025-03:05:05] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +4545, now: CPU 0, GPU 4681 (MiB) INFO: Uvicorn running on http://0.0.0.0:50001 (Press CTRL+C to quit) INFO: Started parent process [261530] 2025-03-21 03:05:07,088 - modelscope - INFO - PyTorch version 2.5.1 Found. 2025-03-21 03:05:07,088 - modelscope - INFO - Loading ast index from /home/oppoer/.cache/modelscope/ast_indexer 2025-03-21 03:05:07,088 - modelscope - INFO - PyTorch version 2.5.1 Found. 2025-03-21 03:05:07,089 - modelscope - INFO - Loading ast index from /home/oppoer/.cache/modelscope/ast_indexer 2025-03-21 03:05:07,112 - modelscope - INFO - Loading done! Current index file version is 1.15.0, with md5 c9bd43650d2dc196e6e1b33b56c12959 and a total number of 980 components indexed 2025-03-21 03:05:07,112 - modelscope - INFO - Loading done! Current index file version is 1.15.0, with md5 c9bd43650d2dc196e6e1b33b56c12959 and a total number of 980 components indexed failed to import ttsfrd, use WeTextProcessing instead failed to import ttsfrd, use WeTextProcessing instead INFO: Started server process [262208] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Started server process [262207] INFO: Waiting for application startup. INFO: Application startup complete. INFO: 10.233.185.41:34732 - "GET /inference_test HTTP/1.0" 500 Internal Server Error ERROR: Exception in ASGI application Traceback (most recent call last): File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi result = await app( # type: ignore[func-returns-value] File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in call return await self.app(scope, receive, send) File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in call await super().call(scope, receive, send) File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/applications.py", line 113, in call await self.middleware_stack(scope, receive, send) File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in call raise exc File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in call await self.app(scope, receive, _send) File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in call await self.app(scope, receive, send) File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in call await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app raise exc File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app await app(scope, receive, sender) File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/routing.py", line 715, in call await self.middleware_stack(scope, receive, send) File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/routing.py", line 735, in app await route.handle(scope, receive, send) File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle await self.app(scope, receive, send) File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/routing.py", line 76, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app raise exc File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app await app(scope, receive, sender) File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/routing.py", line 73, in app response = await f(request) File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app raw_response = await run_endpoint_function( File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function return await dependant.call(**values) File "/home/notebook/data/personal/tts/demo/dev-Comet-CosyVoice/runtime/python/fastapi/server.py", line 72, in inference_test model_output = cosyvoice.inference_sft(tts_text, spk_id) NameError: name 'cosyvoice' is not defined. Did you mean: 'CosyVoice'?

2、另外一个问题是,目前推理的时候,llm模型用了vllm进行加速,并且也是支持并发的,但是后面的flow这些模型,好像不能支持并发,是不是也可以多实例化几个flow模型出来,然后来处理vllm并发输出后的llm结果呢,不知道这种方式是不是可以把整个推理流程的并发数再提上来一些呀,个人并不太懂异步和并发的一些东西哈,如果这个思路可行,不知道您这边有没有计划再优化一个版本出来呢,目前vllm加速的效果还是非常明显了,再次感谢您对开源社区的贡献哈~~

hi,看你提到的vllm支持并发,我测下来单个vllm实例,并发请求时耗时是会增加的,想咨询下你是怎么使用的

@jnkr36
Copy link

jnkr36 commented Apr 3, 2025

非常感谢@qi-hua大佬的回复哈,受教了~~,通过这种方式可以成功运行。 另外再请教两个问题哈, 1、我想把上面的流程用fastapi封装成服务,为了支持多一些的并发,启动fastapi服务的时候,我想启动两个worker(我理解应该就是会实例化两个CosyVoice2出来,显存占用也会double一下,目前GPU显存够double的量),但是启动的时候,加入了workers参数,好像就运行不了,不加worker的话(默认应该就是1吧),是可以的。这个是启动服务的代码 image
这个是调用服务的日志,感觉是启动2个worker的时候,并没有正确的初始化CosyVoice2,导致找不到。不知道如果想启动两个worker应该怎么操作呢,另外,这种启动多个worker的方式是不是可以支持更多的并发呀,还麻烦帮忙解答一下哈。 2025-03-21 03:04:21,320 - modelscope - INFO - PyTorch version 2.5.1 Found. 2025-03-21 03:04:21,321 - modelscope - INFO - Loading ast index from /home/oppoer/.cache/modelscope/ast_indexer 2025-03-21 03:04:21,342 - modelscope - INFO - Loading done! Current index file version is 1.15.0, with md5 c9bd43650d2dc196e6e1b33b56c12959 and a total number of 980 components indexed failed to import ttsfrd, use WeTextProcessing instead 2025-03-21 03:04:24,341 DEBUG Starting new HTTPS connection (1): www.modelscope.cn:443 2025-03-21 03:04:24,686 DEBUG https://www.modelscope.cn:443 "GET /api/v1/models/iic/CosyVoice2-0.5B/revisions HTTP/1.1" 200 205 2025-03-21 03:04:25,148 DEBUG https://www.modelscope.cn:443 "GET /api/v1/models/iic/CosyVoice2-0.5B/repo/files?Revision=master&Recursive=True HTTP/1.1" 200 None 2025-03-21 03:04:25,202 DEBUG Starting new HTTPS connection (1): www.modelscope.cn:443 2025-03-21 03:04:25,575 DEBUG https://www.modelscope.cn:443 "GET /api/v1/models/iic/CosyVoice2-0.5B/revisions HTTP/1.1" 200 205 2025-03-21 03:04:25,982 DEBUG https://www.modelscope.cn:443 "GET /api/v1/models/iic/CosyVoice2-0.5B/repo/files?Revision=master&Recursive=True HTTP/1.1" 200 None Sliding Window Attention is enabled but not implemented for sdpa; unexpected results may be encountered. /opt/conda/envs/cosyvoice/lib/python3.10/site-packages/diffusers/models/lora.py:393: FutureWarning: LoRACompatibleLinear is deprecated and will be removed in version 1.0.0. Use of LoRACompatibleLinear is deprecated. Please switch to PEFT backend by installing PEFT: pip install peft. deprecate("LoRACompatibleLinear", "1.0.0", deprecation_message) 2025-03-21 03:04:28,165 INFO input frame rate=25 2025-03-21 03:04:29.524953394 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 8 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message. 2025-03-21 03:04:29.527004226 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2025-03-21 03:04:29.527025659 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments. 2025-03-21 03:04:29,811 WETEXT INFO found existing fst: /opt/conda/envs/cosyvoice/lib/python3.10/site-packages/tn/zh_tn_tagger.fst 2025-03-21 03:04:29,811 INFO found existing fst: /opt/conda/envs/cosyvoice/lib/python3.10/site-packages/tn/zh_tn_tagger.fst 2025-03-21 03:04:29,811 WETEXT INFO /opt/conda/envs/cosyvoice/lib/python3.10/site-packages/tn/zh_tn_verbalizer.fst 2025-03-21 03:04:29,811 INFO /opt/conda/envs/cosyvoice/lib/python3.10/site-packages/tn/zh_tn_verbalizer.fst 2025-03-21 03:04:29,811 WETEXT INFO skip building fst for zh_normalizer ... 2025-03-21 03:04:29,811 INFO skip building fst for zh_normalizer ... 2025-03-21 03:04:30,024 WETEXT INFO found existing fst: /opt/conda/envs/cosyvoice/lib/python3.10/site-packages/tn/en_tn_tagger.fst 2025-03-21 03:04:30,024 INFO found existing fst: /opt/conda/envs/cosyvoice/lib/python3.10/site-packages/tn/en_tn_tagger.fst 2025-03-21 03:04:30,024 WETEXT INFO /opt/conda/envs/cosyvoice/lib/python3.10/site-packages/tn/en_tn_verbalizer.fst 2025-03-21 03:04:30,024 INFO /opt/conda/envs/cosyvoice/lib/python3.10/site-packages/tn/en_tn_verbalizer.fst 2025-03-21 03:04:30,024 WETEXT INFO skip building fst for en_normalizer ... 2025-03-21 03:04:30,024 INFO skip building fst for en_normalizer ... INFO 03-21 03:04:30 init.py:207] Automatically detected platform cuda. WARNING 03-21 03:04:32 registry.py:352] Model architecture CosyVoice2Model is already registered, and will be overwritten by the new model class <class 'cosyvoice.llm.vllm_use_cosyvoice2_model.CosyVoice2Model'>. WARNING 03-21 03:04:32 config.py:2448] Casting torch.bfloat16 to torch.float16. INFO 03-21 03:04:32 config.py:549] This model supports multiple tasks: {'score', 'embed', 'classify', 'reward', 'generate'}. Defaulting to 'generate'. INFO 03-21 03:04:32 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=1024. WARNING 03-21 03:04:32 utils.py:2128] CUDA was previously initialized. We must use the spawn multiprocessing start method. Setting VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing for more information. 2025-03-21 03:04:33,837 - modelscope - INFO - PyTorch version 2.5.1 Found. 2025-03-21 03:04:33,838 - modelscope - INFO - Loading ast index from /home/oppoer/.cache/modelscope/ast_indexer 2025-03-21 03:04:33,859 - modelscope - INFO - Loading done! Current index file version is 1.15.0, with md5 c9bd43650d2dc196e6e1b33b56c12959 and a total number of 980 components indexed failed to import ttsfrd, use WeTextProcessing instead INFO 03-21 03:04:37 init.py:207] Automatically detected platform cuda. INFO 03-21 03:04:38 core.py:50] Initializing a V1 LLM engine (v0.7.3) with config: model='/home/oppoer/.cache/modelscope/hub/iic/CosyVoice2-0___5B', speculative_config=None, tokenizer='/home/oppoer/.cache/modelscope/hub/iic/CosyVoice2-0___5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/oppoer/.cache/modelscope/hub/iic/CosyVoice2-0___5B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512} WARNING 03-21 03:04:38 utils.py:2262] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,list_loras,load_config,pin_lora,remove_lora,scheduler_config not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f239070ebc0> INFO 03-21 03:04:38 gpu_model_runner.py:1049] Starting to load model /home/oppoer/.cache/modelscope/hub/iic/CosyVoice2-0___5B... INFO 03-21 03:04:38 cuda.py:157] Using Flash Attention backend on V1 engine. WARNING 03-21 03:04:38 topk_topp_sampler.py:46] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer. WARNING 03-21 03:04:38 rejection_sampler.py:47] FlashInfer is not available. Falling back to the PyTorch-native implementation of rejection sampling. For the best performance, please install FlashInfer. /opt/conda/envs/cosyvoice/lib/python3.10/site-packages/torch/utils/device.py:106: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad(True), rather than torch.tensor(sourceTensor). return func(*args, **kwargs) Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s] Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.27s/it] Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.27s/it]
INFO 03-21 03:04:40 gpu_model_runner.py:1060] Loading model weights took 0.9532 GB INFO 03-21 03:04:44 backends.py:408] Using cache directory: /home/oppoer/.cache/vllm/torch_compile_cache/032e9e3730/rank_0 for vLLM's torch.compile INFO 03-21 03:04:44 backends.py:418] Dynamo bytecode transform time: 3.55 s INFO 03-21 03:04:44 backends.py:115] Directly load the compiled graph for shape None from the cache INFO 03-21 03:04:46 monitor.py:33] torch.compile takes 3.55 s in total INFO 03-21 03:04:46 kv_cache_utils.py:522] # GPU blocks: 80354 INFO 03-21 03:04:46 kv_cache_utils.py:525] Maximum concurrency for 1024 tokens per request: 1255.53x INFO 03-21 03:05:02 gpu_model_runner.py:1339] Graph capturing finished in 15 secs, took 1.26 GiB INFO 03-21 03:05:02 core.py:116] init engine (profile, create kv cache, warmup model) took 22.00 seconds 2025-03-21 03:05:02,324 DEBUG Using selector: EpollSelector [03/21/2025-03:05:04] [TRT] [I] Loaded engine size: 159 MiB [03/21/2025-03:05:04] [TRT] [I] [MS] Running engine with multi stream info [03/21/2025-03:05:04] [TRT] [I] [MS] Number of aux streams is 1 [03/21/2025-03:05:04] [TRT] [I] [MS] Number of total worker streams is 2 [03/21/2025-03:05:04] [TRT] [I] [MS] The main stream provided by execute/enqueue calls is the first worker stream [03/21/2025-03:05:05] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +4545, now: CPU 0, GPU 4681 (MiB) INFO: Uvicorn running on http://0.0.0.0:50001 (Press CTRL+C to quit) INFO: Started parent process [261530] 2025-03-21 03:05:07,088 - modelscope - INFO - PyTorch version 2.5.1 Found. 2025-03-21 03:05:07,088 - modelscope - INFO - Loading ast index from /home/oppoer/.cache/modelscope/ast_indexer 2025-03-21 03:05:07,088 - modelscope - INFO - PyTorch version 2.5.1 Found. 2025-03-21 03:05:07,089 - modelscope - INFO - Loading ast index from /home/oppoer/.cache/modelscope/ast_indexer 2025-03-21 03:05:07,112 - modelscope - INFO - Loading done! Current index file version is 1.15.0, with md5 c9bd43650d2dc196e6e1b33b56c12959 and a total number of 980 components indexed 2025-03-21 03:05:07,112 - modelscope - INFO - Loading done! Current index file version is 1.15.0, with md5 c9bd43650d2dc196e6e1b33b56c12959 and a total number of 980 components indexed failed to import ttsfrd, use WeTextProcessing instead failed to import ttsfrd, use WeTextProcessing instead INFO: Started server process [262208] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Started server process [262207] INFO: Waiting for application startup. INFO: Application startup complete. INFO: 10.233.185.41:34732 - "GET /inference_test HTTP/1.0" 500 Internal Server Error ERROR: Exception in ASGI application Traceback (most recent call last): File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi result = await app( # type: ignore[func-returns-value] File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in call return await self.app(scope, receive, send) File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in call await super().call(scope, receive, send) File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/applications.py", line 113, in call await self.middleware_stack(scope, receive, send) File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in call raise exc File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in call await self.app(scope, receive, _send) File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in call await self.app(scope, receive, send) File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in call await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app raise exc File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app await app(scope, receive, sender) File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/routing.py", line 715, in call await self.middleware_stack(scope, receive, send) File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/routing.py", line 735, in app await route.handle(scope, receive, send) File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle await self.app(scope, receive, send) File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/routing.py", line 76, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app raise exc File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app await app(scope, receive, sender) File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/starlette/routing.py", line 73, in app response = await f(request) File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app raw_response = await run_endpoint_function( File "/opt/conda/envs/cosyvoice/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function return await dependant.call(**values) File "/home/notebook/data/personal/tts/demo/dev-Comet-CosyVoice/runtime/python/fastapi/server.py", line 72, in inference_test model_output = cosyvoice.inference_sft(tts_text, spk_id) NameError: name 'cosyvoice' is not defined. Did you mean: 'CosyVoice'?
2、另外一个问题是,目前推理的时候,llm模型用了vllm进行加速,并且也是支持并发的,但是后面的flow这些模型,好像不能支持并发,是不是也可以多实例化几个flow模型出来,然后来处理vllm并发输出后的llm结果呢,不知道这种方式是不是可以把整个推理流程的并发数再提上来一些呀,个人并不太懂异步和并发的一些东西哈,如果这个思路可行,不知道您这边有没有计划再优化一个版本出来呢,目前vllm加速的效果还是非常明显了,再次感谢您对开源社区的贡献哈~~

hi,看你提到的vllm支持并发,我测下来单个vllm实例,并发请求时耗时是会增加的,想咨询下你是怎么使用的

1、首先原始的没有用vllm的版本,应该是不支持并发的,一个请求需要5s的话,并发5个请求过去,就需要25s左右了。如下:
1)ab -n 1 -c 1 10.77.16.155:50000/test

This is ApacheBench, Version 2.3 <$Revision: 1879490 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 10.77.16.155 (be patient).....done

Server Software: uvicorn
Server Hostname: 10.77.16.155
Server Port: 50000

Document Path: /test
Document Length: 493454 bytes

Concurrency Level: 1
Time taken for tests: 4.533 seconds
Complete requests: 1
Failed requests: 0
Total transferred: 493555 bytes
HTML transferred: 493454 bytes
Requests per second: 0.22 [#/sec] (mean)
Time per request: 4532.984 [ms] (mean)
Time per request: 4532.984 [ms] (mean, across all concurrent requests)
Transfer rate: 106.33 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 3 3 0.0 3 3
Processing: 4530 4530 0.0 4530 4530
Waiting: 4 4 0.0 4 4
Total: 4533 4533 0.0 4533 4533

2)ab -n 5 -c 5 10.77.16.155:50000/test
This is ApacheBench, Version 2.3 <$Revision: 1879490 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 10.77.16.155 (be patient).....done

Server Software: uvicorn
Server Hostname: 10.77.16.155
Server Port: 50000

Document Path: /test
Document Length: 437774 bytes

Concurrency Level: 5
Time taken for tests: 25.639 seconds
Complete requests: 5
Failed requests: 4
(Connect: 0, Receive: 0, Length: 4, Exceptions: 0)
Total transferred: 2429375 bytes
HTML transferred: 2428870 bytes
Requests per second: 0.20 [#/sec] (mean)
Time per request: 25638.886 [ms] (mean)
Time per request: 5127.777 [ms] (mean, across all concurrent requests)
Transfer rate: 92.53 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 3 3 0.0 3 3
Processing: 22839 24178 1151.0 24518 25630
Waiting: 4 9 6.9 8 21
Total: 22841 24181 1151.1 24521 25632

Percentage of the requests served within a certain time (ms)
50% 24001
66% 25041
75% 25041
80% 25632
90% 25632
95% 25632
98% 25632
99% 25632
100% 25632 (longest request)

2、用了vllm的版本再运行的时候,vllm可以对推理里面的千问llm进行加速(这个也是推理的大头),但对后面的flow等推理没有处理,但整体对推理加速还是比较明显的,但个请求的处理时长从4-5s降到了1点几秒,5个请求同时发过去,服务端处理时长也不是线性增加的,应该是vllm的并发再出来,但还是会比单个请求的时间长一些,应该是flow那边的模型推理没有并行能力,vllm推理完了之后,可能需要排队flow那边一个一个来处理,所以时间有所增加一些。

3)ab -n 1 -c 1 10.77.16.155:50000/test
This is ApacheBench, Version 2.3 <$Revision: 1879490 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 10.77.16.155 (be patient).....done

Server Software: uvicorn
Server Hostname: 10.77.16.155
Server Port: 50000

Document Path: /test
Document Length: 86462 bytes

Concurrency Level: 1
Time taken for tests: 1.385 seconds
Complete requests: 1
Failed requests: 0
Total transferred: 86563 bytes
HTML transferred: 86462 bytes
Requests per second: 0.72 [#/sec] (mean)
Time per request: 1384.836 [ms] (mean)
Time per request: 1384.836 [ms] (mean, across all concurrent requests)
Transfer rate: 61.04 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 3 3 0.0 3 3
Processing: 1382 1382 0.0 1382 1382
Waiting: 3 3 0.0 3 3
Total: 1385 1385 0.0 1385 1385

4)ab -n 5 -c 5 10.77.16.155:50000/test

This is ApacheBench, Version 2.3 <$Revision: 1879490 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 10.77.16.155 (be patient).....done

Server Software: uvicorn
Server Hostname: 10.77.16.155
Server Port: 50000

Document Path: /test
Document Length: 88536 bytes

Concurrency Level: 5
Time taken for tests: 2.510 seconds
Complete requests: 5
Failed requests: 4
(Connect: 0, Receive: 0, Length: 4, Exceptions: 0)
Total transferred: 430185 bytes
HTML transferred: 429680 bytes
Requests per second: 1.99 [#/sec] (mean)
Time per request: 2510.320 [ms] (mean)
Time per request: 502.064 [ms] (mean, across all concurrent requests)
Transfer rate: 167.35 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 3 3 0.1 3 3
Processing: 2047 2327 237.4 2500 2501
Waiting: 4 10 7.4 8 23
Total: 2050 2330 237.3 2503 2503

Percentage of the requests served within a certain time (ms)
50% 2503
66% 2503
75% 2503
80% 2503
90% 2503
95% 2503
98% 2503
99% 2503
100% 2503 (longest request)

@hjj-lmx
Copy link

hjj-lmx commented Apr 8, 2025

这个代码合并到主分支了没有,或者在那个分支上可以找到代码

@hjj-lmx
Copy link

hjj-lmx commented Apr 8, 2025

dev/Comet 根本跑不起来啊

@russell-shu
Copy link

你好 我在使用这个分支时,没有跑起来。
我是新创建的conda环境,使用的requriements_vllm.txt安装,系统环境是wsl ubuntu24.04
报错如下:
/home/russell/anaconda3/bin/conda run -n vllm --no-capture-output python /mnt/d/TTS/CosyVoice_vllm/CosyVoice/test.py
2025-04-22 17:29:21,778 - modelscope - INFO - PyTorch version 2.5.1 Found.
2025-04-22 17:29:21,783 - modelscope - INFO - Loading ast index from /home/russell/.cache/modelscope/ast_indexer
2025-04-22 17:29:22,006 - modelscope - INFO - Loading done! Current index file version is 1.15.0, with md5 5f3a4ed3862dc5e337237022776d03e3 and a total number of 980 components indexed
failed to import ttsfrd, use WeTextProcessing instead
Sliding Window Attention is enabled but not implemented for sdpa; unexpected results may be encountered.
/home/russell/anaconda3/envs/vllm/lib/python3.10/site-packages/diffusers/models/lora.py:393: FutureWarning: LoRACompatibleLinear is deprecated and will be removed in version 1.0.0. Use of LoRACompatibleLinear is deprecated. Please switch to PEFT backend by installing PEFT: pip install peft.
deprecate("LoRACompatibleLinear", "1.0.0", deprecation_message)
2025-04-22 17:29:42,613 INFO input frame rate=25
2025-04-22 17:29:45.999075180 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 8 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2025-04-22 17:29:46.002326607 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2025-04-22 17:29:46.002361559 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2025-04-22 17:29:46,388 WETEXT INFO found existing fst: /home/russell/anaconda3/envs/vllm/lib/python3.10/site-packages/tn/zh_tn_tagger.fst
2025-04-22 17:29:46,388 INFO found existing fst: /home/russell/anaconda3/envs/vllm/lib/python3.10/site-packages/tn/zh_tn_tagger.fst
2025-04-22 17:29:46,388 WETEXT INFO /home/russell/anaconda3/envs/vllm/lib/python3.10/site-packages/tn/zh_tn_verbalizer.fst
2025-04-22 17:29:46,388 INFO /home/russell/anaconda3/envs/vllm/lib/python3.10/site-packages/tn/zh_tn_verbalizer.fst
2025-04-22 17:29:46,388 WETEXT INFO skip building fst for zh_normalizer ...
2025-04-22 17:29:46,388 INFO skip building fst for zh_normalizer ...
2025-04-22 17:29:46,739 WETEXT INFO found existing fst: /home/russell/anaconda3/envs/vllm/lib/python3.10/site-packages/tn/en_tn_tagger.fst
2025-04-22 17:29:46,739 INFO found existing fst: /home/russell/anaconda3/envs/vllm/lib/python3.10/site-packages/tn/en_tn_tagger.fst
2025-04-22 17:29:46,739 WETEXT INFO /home/russell/anaconda3/envs/vllm/lib/python3.10/site-packages/tn/en_tn_verbalizer.fst
2025-04-22 17:29:46,739 INFO /home/russell/anaconda3/envs/vllm/lib/python3.10/site-packages/tn/en_tn_verbalizer.fst
2025-04-22 17:29:46,739 WETEXT INFO skip building fst for en_normalizer ...
2025-04-22 17:29:46,739 INFO skip building fst for en_normalizer ...
INFO 04-22 17:29:47 init.py:207] Automatically detected platform cuda.
WARNING 04-22 17:29:49 registry.py:352] Model architecture CosyVoice2Model is already registered, and will be overwritten by the new model class <class 'cosyvoice.llm.vllm_use_cosyvoice2_model.CosyVoice2Model'>.
WARNING 04-22 17:29:49 config.py:2448] Casting torch.bfloat16 to torch.float16.
INFO 04-22 17:29:49 config.py:549] This model supports multiple tasks: {'reward', 'generate', 'score', 'embed', 'classify'}. Defaulting to 'generate'.
INFO 04-22 17:29:49 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=1024.
INFO 04-22 17:29:49 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='pretrained_models/CosyVoice2-0.5B', speculative_config=None, tokenizer='pretrained_models/CosyVoice2-0.5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=pretrained_models/CosyVoice2-0.5B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}, use_cached_outputs=False,
WARNING 04-22 17:29:49 utils.py:2262] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,list_loras,load_config,pin_lora,remove_lora,scheduler_config not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f4a8a893730>
WARNING 04-22 17:29:51 interface.py:304] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 04-22 17:29:51 gpu_model_runner.py:1049] Starting to load model pretrained_models/CosyVoice2-0.5B...
INFO 04-22 17:29:51 cuda.py:157] Using Flash Attention backend on V1 engine.
WARNING 04-22 17:29:51 topk_topp_sampler.py:46] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 04-22 17:29:51 rejection_sampler.py:47] FlashInfer is not available. Falling back to the PyTorch-native implementation of rejection sampling. For the best performance, please install FlashInfer.
/home/russell/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/device.py:106: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad(True), rather than torch.tensor(sourceTensor).
return func(*args, **kwargs)
Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:11<00:00, 11.13s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:11<00:00, 11.13s/it]

INFO 04-22 17:30:02 gpu_model_runner.py:1060] Loading model weights took 0.9532 GB
2025-04-22 17:30:02,817 WARNING use vllm inference failed.

[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/d/TTS/CosyVoice_vllm/CosyVoice/test.py", line 39, in
[rank0]: main()
[rank0]: File "/mnt/d/TTS/CosyVoice_vllm/CosyVoice/test.py", line 34, in main
[rank0]: cosyvoice = CosyVoice2('pretrained_models/CosyVoice2-0.5B', load_jit=True, load_trt=True, fp16=True, use_vllm=True)
[rank0]: File "/mnt/d/TTS/CosyVoice_vllm/CosyVoice/cosyvoice/cli/cosyvoice.py", line 174, in init
[rank0]: raise e
[rank0]: File "/mnt/d/TTS/CosyVoice_vllm/CosyVoice/cosyvoice/cli/cosyvoice.py", line 171, in init
[rank0]: self.model = VllmCosyVoice2Model(model_dir, configs['flow'], configs['hift'], fp16)
[rank0]: File "/mnt/d/TTS/CosyVoice_vllm/CosyVoice/cosyvoice/cli/model.py", line 456, in init
[rank0]: llm = VllmQwen2LM(model_dir)
[rank0]: File "/mnt/d/TTS/CosyVoice_vllm/CosyVoice/cosyvoice/llm/llm_vllm.py", line 81, in init
[rank0]: self.llm_engine = AsyncLLMEngine.from_engine_args(engine_args)
[rank0]: File "/home/russell/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 644, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/home/russell/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 594, in init
[rank0]: self.engine = self._engine_class(*args, **kwargs)
[rank0]: File "/home/russell/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 267, in init
[rank0]: super().init(*args, **kwargs)
[rank0]: File "/home/russell/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 276, in init
[rank0]: self._initialize_kv_caches()
[rank0]: File "/home/russell/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 421, in _initialize_kv_caches
[rank0]: self.model_executor.determine_num_available_blocks())
[rank0]: File "/home/russell/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 102, in determine_num_available_blocks
[rank0]: results = self.collective_rpc("determine_num_available_blocks")
[rank0]: File "/home/russell/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
[rank0]: answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]: File "/home/russell/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/utils.py", line 2196, in run_method
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/russell/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 107, in determine_num_available_blocks
[rank0]: raise NotImplementedError
[rank0]: NotImplementedError
[rank0]:[W422 17:30:03.018674493 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
ERROR conda.cli.main_run:execute(125): conda run python /mnt/d/TTS/CosyVoice_vllm/CosyVoice/test.py failed. (See above for error)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants